Linear and Quadratic Discriminant Analysis (LDA / QDA)#

Discriminant Analysis is a classic family of generative classifiers.

  • LDA assumes each class is Gaussian with a shared covariance → linear decision boundaries

  • QDA allows each class to have its own covariance → quadratic decision boundaries

Learning goals#

  • understand the generative story: \(p(x\mid y)\) + Bayes → \(p(y\mid x)\)

  • derive the discriminant score for LDA/QDA

  • build intuition for when the boundary becomes linear vs quadratic

  • implement LDA/QDA from scratch (NumPy)

  • use scikit-learn’s LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis

Table of contents#

  1. Generative vs discriminative intuition

  2. Gaussian class-conditional model

  3. QDA discriminant function (quadratic)

  4. LDA as a constrained QDA (linear)

  5. From-scratch LDA/QDA

  6. Visual decision boundaries (Plotly)

  7. When does QDA overfit?

  8. LDA as supervised dimensionality reduction

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)

1) Generative vs discriminative intuition#

Two worldviews:

Discriminative#

Learn \(p(y\mid x)\) directly (or a decision boundary).

Examples:

  • logistic regression

  • SVM

  • neural networks

Generative#

Model how data is generated:

\[p(x\mid y)\ \text{and}\ p(y)\]

Then use Bayes:

\[p(y\mid x) \propto p(x\mid y)\,p(y)\]

Examples:

  • Naive Bayes

  • LDA / QDA

Anecdote:

Discriminative models are like a judge who learns where to draw the line. Generative models are like a novelist who learns how each class produces data, then uses that story to decide.

2) Gaussian class-conditional model#

Assume that for each class \(k\):

\[ (x \mid y=k) \sim \mathcal{N}(\mu_k, \Sigma_k) \]

Then:

\[ \log p(x\mid y=k) = -\tfrac12\log|\Sigma_k| - \tfrac12 (x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k) + \text{const} \]

Add the log prior \(\log \pi_k\) and you get a discriminant score.

# A 2D dataset where classes have different covariance shapes (QDA-friendly)

n = 900
mu0 = np.array([-2.0, -1.0])
mu1 = np.array([+2.0, +1.0])

Sigma0 = np.array([[1.8, 0.8], [0.8, 1.0]])
Sigma1 = np.array([[1.0, -0.6], [-0.6, 1.6]])

X0 = rng.multivariate_normal(mu0, Sigma0, size=n // 2)
X1 = rng.multivariate_normal(mu1, Sigma1, size=n // 2)

X = np.vstack([X0, X1])
y = np.array([0] * (n // 2) + [1] * (n // 2))

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)

fig = px.scatter(
    x=X_tr[:, 0],
    y=X_tr[:, 1],
    color=y_tr.astype(str),
    title="Training set (different class covariances)",
    labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=6, opacity=0.65))
fig.update_layout(width=760, height=500)
fig.show()
def covariance_ellipse_points(mean: np.ndarray, cov: np.ndarray, n_points: int = 200, n_std: float = 2.0):
    """Return x,y points of an ellipse representing the covariance."""

    mean = np.asarray(mean, dtype=float)
    cov = np.asarray(cov, dtype=float)

    vals, vecs = np.linalg.eigh(cov)
    order = np.argsort(vals)[::-1]
    vals = vals[order]
    vecs = vecs[:, order]

    # ellipse axes lengths
    radii = n_std * np.sqrt(np.maximum(vals, 0.0))

    t = np.linspace(0, 2 * np.pi, n_points)
    circle = np.vstack([np.cos(t), np.sin(t)])
    ellipse = (vecs @ (radii[:, None] * circle)).T
    ellipse = ellipse + mean[None, :]
    return ellipse[:, 0], ellipse[:, 1]


# Estimate means/covariances from training data
mu0_hat = X_tr[y_tr == 0].mean(axis=0)
mu1_hat = X_tr[y_tr == 1].mean(axis=0)
S0_hat = np.cov(X_tr[y_tr == 0].T, bias=False)
S1_hat = np.cov(X_tr[y_tr == 1].T, bias=False)

x0e, y0e = covariance_ellipse_points(mu0_hat, S0_hat, n_std=2.0)
x1e, y1e = covariance_ellipse_points(mu1_hat, S1_hat, n_std=2.0)

fig = go.Figure()
fig.add_trace(go.Scatter(x=X_tr[:, 0], y=X_tr[:, 1], mode="markers",
                         marker=dict(color=y_tr, colorscale="Viridis", size=6, opacity=0.6),
                         name="train"))
fig.add_trace(go.Scatter(x=x0e, y=y0e, mode="lines", name="class 0 (2σ ellipse)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x1e, y=y1e, mode="lines", name="class 1 (2σ ellipse)", line=dict(width=3)))
fig.update_layout(title="Empirical covariance ellipses", width=760, height=520)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

3) QDA discriminant function (quadratic)#

For class \(k\), the log posterior (up to a shared constant) is:

\[ \delta_k(x) = \log \pi_k\; -\; \tfrac12\log|\Sigma_k|\; -\; \tfrac12 (x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k) \]

Pick the class with maximum score:

\[\hat{y}(x) = \arg\max_k \delta_k(x)\]

Because of the quadratic form \((x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)\), the boundary is generally quadratic.

4) LDA as a constrained QDA (linear)#

LDA makes one simplifying assumption:

\[\Sigma_k = \Sigma \quad \text{(shared across classes)}\]

When covariances are shared, the quadratic term in \(x\) cancels when comparing classes.

The discriminant becomes linear:

\[ \delta_k(x) = x^\top \Sigma^{-1}\mu_k - \tfrac12 \mu_k^\top\Sigma^{-1}\mu_k + \log \pi_k \]

So the decision boundary is a hyperplane.

Intuition:

  • LDA is “one common ellipse shape for everyone, but different centers”

  • QDA is “each class gets its own ellipse shape and orientation”

class ScratchLDA:
    def __init__(self, reg: float = 1e-6):
        self.reg = float(reg)

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y)

        self.classes_, y_enc = np.unique(y, return_inverse=True)
        K = self.classes_.shape[0]
        n, d = X.shape

        self.priors_ = np.bincount(y_enc, minlength=K).astype(float)
        self.priors_ = self.priors_ / self.priors_.sum()

        self.means_ = np.zeros((K, d), dtype=float)
        for k in range(K):
            self.means_[k] = X[y_enc == k].mean(axis=0)

        # pooled covariance (unbiased)
        S = np.zeros((d, d), dtype=float)
        for k in range(K):
            Xk = X[y_enc == k]
            Xc = Xk - self.means_[k]
            S += Xc.T @ Xc
        S /= (n - K)

        S = S + self.reg * np.eye(d)
        self.cov_ = S
        self.precision_ = np.linalg.inv(S)

        # precompute terms for linear discriminant
        self._w_ = (self.precision_ @ self.means_.T).T  # (K,d)
        self._b_ = -0.5 * np.sum(self.means_ * self._w_, axis=1) + np.log(self.priors_ + 1e-300)
        return self

    def decision_function(self, X: np.ndarray) -> np.ndarray:
        X = np.asarray(X, dtype=float)
        return X @ self._w_.T + self._b_[None, :]

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        scores = self.decision_function(X)
        scores = scores - scores.max(axis=1, keepdims=True)
        probs = np.exp(scores)
        probs = probs / probs.sum(axis=1, keepdims=True)
        return probs

    def predict(self, X: np.ndarray) -> np.ndarray:
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]


class ScratchQDA:
    def __init__(self, reg: float = 1e-6):
        self.reg = float(reg)

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y)

        self.classes_, y_enc = np.unique(y, return_inverse=True)
        K = self.classes_.shape[0]
        n, d = X.shape

        self.priors_ = np.bincount(y_enc, minlength=K).astype(float)
        self.priors_ = self.priors_ / self.priors_.sum()

        self.means_ = np.zeros((K, d), dtype=float)
        self.covs_ = np.zeros((K, d, d), dtype=float)
        self.precisions_ = np.zeros((K, d, d), dtype=float)
        self.logdets_ = np.zeros(K, dtype=float)

        for k in range(K):
            Xk = X[y_enc == k]
            self.means_[k] = Xk.mean(axis=0)
            Sk = np.cov(Xk.T, bias=False) + self.reg * np.eye(d)
            self.covs_[k] = Sk
            self.precisions_[k] = np.linalg.inv(Sk)
            self.logdets_[k] = float(np.linalg.slogdet(Sk)[1])

        return self

    def decision_function(self, X: np.ndarray) -> np.ndarray:
        X = np.asarray(X, dtype=float)
        n = X.shape[0]
        K = self.classes_.shape[0]

        scores = np.zeros((n, K), dtype=float)
        for k in range(K):
            diff = X - self.means_[k]
            maha = np.sum((diff @ self.precisions_[k]) * diff, axis=1)
            scores[:, k] = (
                np.log(self.priors_[k] + 1e-300)
                - 0.5 * self.logdets_[k]
                - 0.5 * maha
            )
        return scores

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        scores = self.decision_function(X)
        scores = scores - scores.max(axis=1, keepdims=True)
        probs = np.exp(scores)
        probs = probs / probs.sum(axis=1, keepdims=True)
        return probs

    def predict(self, X: np.ndarray) -> np.ndarray:
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
# Fit scratch + sklearn models
scratch_lda = ScratchLDA(reg=1e-6).fit(X_tr, y_tr)
scratch_qda = ScratchQDA(reg=1e-6).fit(X_tr, y_tr)

sk_lda = LinearDiscriminantAnalysis().fit(X_tr, y_tr)
sk_qda = QuadraticDiscriminantAnalysis(reg_param=1e-6).fit(X_tr, y_tr)

pred_lda_s = scratch_lda.predict(X_te)
pred_qda_s = scratch_qda.predict(X_te)

pred_lda = sk_lda.predict(X_te)
pred_qda = sk_qda.predict(X_te)

print("Scratch LDA accuracy:", accuracy_score(y_te, pred_lda_s))
print("Scratch QDA accuracy:", accuracy_score(y_te, pred_qda_s))
print("sklearn LDA accuracy:", accuracy_score(y_te, pred_lda))
print("sklearn QDA accuracy:", accuracy_score(y_te, pred_qda))

print("\nClassification report (sklearn QDA):")
print(classification_report(y_te, pred_qda, digits=3))
Scratch LDA accuracy: 0.9666666666666667
Scratch QDA accuracy: 0.9814814814814815
sklearn LDA accuracy: 0.9666666666666667
sklearn QDA accuracy: 0.9814814814814815

Classification report (sklearn QDA):
              precision    recall  f1-score   support

           0      0.985     0.978     0.981       135
           1      0.978     0.985     0.982       135

    accuracy                          0.981       270
   macro avg      0.982     0.981     0.981       270
weighted avg      0.982     0.981     0.981       270
def plot_boundary(model, X, y, title: str, grid_steps: int = 260):
    x_min, x_max = X[:, 0].min() - 1.2, X[:, 0].max() + 1.2
    y_min, y_max = X[:, 1].min() - 1.2, X[:, 1].max() + 1.2

    xs = np.linspace(x_min, x_max, grid_steps)
    ys = np.linspace(y_min, y_max, grid_steps)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.c_[xx.ravel(), yy.ravel()]

    proba = model.predict_proba(grid)[:, 1].reshape(xx.shape)

    fig = go.Figure()
    fig.add_trace(go.Contour(
        x=xs,
        y=ys,
        z=proba,
        colorscale="RdBu",
        opacity=0.75,
        contours=dict(showlines=False),
        colorbar=dict(title="P(class=1)"),
    ))

    fig.add_trace(go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker=dict(color=y, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
        name="data",
    ))

    fig.update_layout(title=title, width=780, height=540)
    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    return fig


fig1 = plot_boundary(scratch_lda, X_te, y_te, "Scratch LDA decision surface")
fig1.show()

fig2 = plot_boundary(scratch_qda, X_te, y_te, "Scratch QDA decision surface")
fig2.show()

fig3 = plot_boundary(sk_lda, X_te, y_te, "sklearn LDA decision surface")
fig3.show()

fig4 = plot_boundary(sk_qda, X_te, y_te, "sklearn QDA decision surface")
fig4.show()

6) When does QDA overfit?#

QDA estimates a full covariance matrix per class.

For \(d\) features, a full covariance has \(d(d+1)/2\) parameters.

So QDA can be data-hungry:

  • in low dimension with enough data → QDA can be great

  • in high dimension or small datasets → QDA can overfit

LDA shares one covariance across classes, so it has fewer parameters and is often more stable.

def sample_gaussian_dataset(n_samples: int, seed: int = 0):
    r = np.random.default_rng(seed)

    mu0 = np.array([-2.0, -1.0])
    mu1 = np.array([+2.0, +1.0])

    Sigma0 = np.array([[1.8, 0.8], [0.8, 1.0]])
    Sigma1 = np.array([[1.0, -0.6], [-0.6, 1.6]])

    X0 = r.multivariate_normal(mu0, Sigma0, size=n_samples // 2)
    X1 = r.multivariate_normal(mu1, Sigma1, size=n_samples // 2)

    X = np.vstack([X0, X1])
    y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))
    return X, y


train_sizes = [40, 80, 140, 220, 350, 500, 800]
acc_lda = []
acc_qda = []

for n_train in train_sizes:
    Xs, ys = sample_gaussian_dataset(n_train + 400, seed=7 + n_train)
    X_tr, X_te, y_tr, y_te = train_test_split(Xs, ys, test_size=400, random_state=7, stratify=ys)

    m_lda = LinearDiscriminantAnalysis().fit(X_tr, y_tr)
    m_qda = QuadraticDiscriminantAnalysis(reg_param=1e-6).fit(X_tr, y_tr)

    acc_lda.append(accuracy_score(y_te, m_lda.predict(X_te)))
    acc_qda.append(accuracy_score(y_te, m_qda.predict(X_te)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=train_sizes, y=acc_lda, mode="lines+markers", name="LDA"))
fig.add_trace(go.Scatter(x=train_sizes, y=acc_qda, mode="lines+markers", name="QDA"))
fig.update_layout(title="Test accuracy vs training size", xaxis_title="# training samples", yaxis_title="accuracy", width=780, height=460)
fig.show()

7) LDA as supervised dimensionality reduction#

LDA is also used as a supervised projection technique.

It finds directions that separate classes by maximizing the ratio:

\[\frac{\text{between-class variance}}{\text{within-class variance}}\]

In scikit-learn, LinearDiscriminantAnalysis(n_components=...) can transform data into a lower-dimensional space.

# 3-class dataset → LDA projection to 2D
X3, y3 = make_blobs(
    n_samples=1200,
    centers=[(-2, 0, 0), (2, 0, 0), (0, 2, 2)],
    cluster_std=[1.4, 1.4, 1.4],
    random_state=7,
)

X3_tr, X3_te, y3_tr, y3_te = train_test_split(X3, y3, test_size=0.3, random_state=7, stratify=y3)

lda_proj = LinearDiscriminantAnalysis(n_components=2)
lda_proj.fit(X3_tr, y3_tr)

Z = lda_proj.transform(X3_te)

fig = px.scatter(x=Z[:, 0], y=Z[:, 1], color=y3_te.astype(str), title="LDA projection (2 components)", labels={"x": "LD1", "y": "LD2", "color": "class"})
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.update_layout(width=760, height=500)
fig.show()

print("LDA projection explained variance ratio (approx):", getattr(lda_proj, 'explained_variance_ratio_', None))
LDA projection explained variance ratio (approx): [0.5896 0.4104]

Summary#

  • LDA/QDA are generative classifiers built from Gaussian class-conditional models.

  • QDA uses class-specific covariances → more flexible, but more parameters.

  • LDA shares covariance → linear boundaries and better stability with limited data.

  • LDA can also be used for supervised dimensionality reduction.

Exercises#

  1. Generate a dataset where covariances are truly equal and compare LDA vs QDA.

  2. Increase dimensionality (e.g. \(d=20\)) and watch QDA become unstable unless you regularize.

  3. Compare Gaussian Naive Bayes (diagonal covariance) vs LDA (full covariance).